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Abstract 

CvL  is  a  library  of  low-level  vector  routines  callable  from  C.  This  library  includes  a  wide  variety 
of  vector  operations  such  as  elementwise  function  applications,  scans,  reduces  and  permutations. 
Most  CvL  routines  are  defined  for  segmented  and  unsegmented  vectors.  This  paper  is  intended  for 
CvL  users  and  implementors,  and  assumes  familiarity  with  vector  operations  and  the  scan-vector 
model  of  parallel  computation. 
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1  Introduction 


CvL  is  a  library  of  low-level  vector  routines  callable  from  C.  This  library  presents  an  abstract 
model  of  a  vector  machine  suitable  either  for  stand-alone  use  or  as  the  backend  of  a  high-level 
language  system.  CvL  includes  a  rich  set  of  vector  operations  including  both  elementwise  compu¬ 
tations,  and  more  global  operations  such  as  scans,  reductions,  and  permutations.  The  library  also 
includes  segmented  versions  of  these  global  operations;  segmented  operations  are  crucial  for  the 
implementation  of  nested  data-parallel  languages  [1,  7,  4]. 

The  vector  machine  model  provided  by  Cvl  is  very  low  level  and  was  designed  so  that  efficient 
versions  of  the  library  could  be  developed  for  a  wide  variety  of  parallel  architectures.  Currently, 
optimized  versions  of  the  library  are  available  for  the  Connection  Machine  CM-2  and  CM-5,  and 
the  Cray  Y-MP  and  Y-MP/C90.  We  and  others  are  developing  versions  for  the  MASPAR  MP- 
1  [10]  and  for  a  network  of  workstations  communicating  with  PVM  [11].  There  is  also  a  portable 
serial  version  of  the  library.  Many  of  the  primitives  provided  by  the  library  (in  particular  the  scan 
operations)  are  much  faster  than  could  be  easily  achieved  with  Fortran  or  C  implementations  on 
these  machines.  This  allows  Cvl  to  be  used  as  an  efficient  stand-alone  library. 

Cvl  was  designed  to  provide  a  vector  abstraction  that  can  be  used  for  the  implementation  of 
higher-level  data-parallel  languages.  The  authors  have  designed  and  implemented  the  nested  data- 
parallel  language  Nesl  [2].  This  language  compiles  into  an  intermediate  language,  VcoDE  [3,  6], 
which  is  then  interpreted.  The  interpreter  uses  Cvl  to  implement  the  required  vector  routines.  In 
addition  to  our  work,  a  research  group  at  Linkoping  University  in  Sweden  has  targeted  VcODE  and 
Cvl  as  compiler  backends  for  the  Predula  parallel  language  [9],  and  researchers  at  rhe  University  of 
North  Carolina,  Chapel  Hill,  have  targeted  Cvl  for  the  data-parallel  portion  of  the  Proteus  [12,  13] 
language. 

This  paper  is  intended  for  CvL  users  and  implementors.  We  also  assume  familiarity  with  vector 
operations  and  the  scan-vector  model  of  parallel  computation  [1].  Because  of  its  intended  audience 
and  the  low-level  nature  of  the  Cvl  abstractions,  this  paper  has  lots  of  grungy  details  that  would 
probably  not  be  interesting  to  a  casual  reader.  We  strongly  urge  the  reader  to  read  the  Nesl 
implementation  paper  [4],  in  order  to  understand  the  context  in  which  CvL  was  designed.  A 
high-level  description  of  the  vector  operations  provided  by  Cvl  can  be  found  either  in  Blelloch’s 
thesis  [l]  or  in  the  published  descriptions  of  VcoDE  [3,  6].  Low-level  details  on  how  scans  and 
segmented  operations  can  be  efficiently  implemented  on  vector  machines  are  also  available  [8]. 


2  Cvl 


me  QUAi-lIY 


Cvl  is  a  C  library  implementing  a  variety  of  vector  operations  on  elements  of  a  homogeneous  vector 
memory.  This  vector  memory  should  be  viewed  as  distinct  from  the  standard  C  heap  or  stack,  and 
should  only  be  accessed  and  modified  through  the  CvL  routines. 

The  CVL  memory  model  distinguishes  between  vector  length  and  vector  size.  Vector  length  is 
the  number  of  elements  in  a  vector;  vector  size  is  the  number  of  “units  of  the  vector  memory” 
occupied  by  a  vector.  The  vector  memory  unit  is  an  implementation-dependent  abstract  quantity 
and  may  indicate  nothing  about  the  number  of  bytes  or  words  taken  up  by  a  vector.  In  most 
implementations  it  refers  to  the  amount  of  memory  used  per  processor  to  hold  the  vector.  An 
integer  vector  of  length  1000,  for  example,  might  only  require  4  units  of  vector  memory  in  one  Cvl 
implementation  and  500  in  another. 

Cvl  provides  functions  that  map  the  length  of  a  vector  to  its  size;  there  is  one  such  function 
for  each  element  type.  For  example,  siz_foz(len)  returns  the  size  of  a  vector  ot  integers,  given 


the  vector’s  length.  The  inverse  mapping  is  not  supplied  and  need  not  be  unique — many  vector 
lengths  might  map  to  a  single  vector  size.  Both  vector  length  and  vector  size  have  C  type  integer. 

This  distinction  between  vector  length  and  size,  and  their  dependence  on  type,  give  added  flex¬ 
ibility  to  the  CvL  implementor:  they  allow  different  mappings  of  data  on  multiprocessor  machines, 
allow  boolean  vectors  to  be  packed  into  bit  vectors,  and  allow  an  implementation  to  pad  vectors 
to  larger  sizes. 

CVL  defines  the  C  type  vac_p  as  an  abstract  handle  for  accessing  vector  memory.  For  each 
vector  in  memory,  there  is  a  vec.p,  and  all  references  to  that  vector  must  be  made  though  that 
vec_p.  A  vac-p  should  be  thought  of  as  a  pointer  into  vector  memory,  but  its  realization  may 
be  more  complicated.  Cvl  defines  an  interface  for  manipulating  and  performing  the  equivalent  of 
pointer  arithmetic  on  a  vac.p.  For  example,  given  a  vac_p  int.vac  corresponding  to  a  integer 
vector  of  length  Ian,  the  call 

vac.p  naa-vac  >  add_fov(int.vac .  siz.foz(lan) ) ; 
gets  a  new  vec_p,  vacmav,  that  corresponds  to  a  block  of  memory  guaranteed  not  to  overlap 
int-vac  in  vector  memory.  This  is  intended  to  be  reminiscent  of  adding  an  offset  to  a  pointer. 
Note  that  the  offset  argument  to  the  vec_p  arithmetic  functions  must  be  based  on  vector  size,  not 
vector  length.  Similarly,  there  i.s  a  function  sub-f  ov(vec_pl ,  vec^2)  that  corresponds  to  pointer 
subtraction  and  returns  the  size  of  the  largest  possible  vector  that  has  vac. pi  as  a  handle  and  does 
not  overlap  a  vector  with  handle  vac_p2. 

Vector  memory  is  allocated  using  the  alo_foz(size)  Cvl  function.  This  returns  a  vec,p  for  a 
block  of  vector  memory  of  the  requested  size.  If  there  is  an  error,  MULL  is  returned.  Vector  memory 
is  deallocated  using  fre_fov(vac_p);  the  argument  to  the  free  function  must  have  been  the  result 
of  a  call  to  alo.foz.  The  current  Cvl  specification  allows  only  a  single  block  of  memory  to  be 
active  during  program  execution;  a  program  using  Cvl  can  only  have  a  single  call  to  alo.foz. 
Multiple  calls  to  alo.foz  have  undefined,  implementation-dependent  semantics. 

Cvl  provides  no  memory  management  facilities  other  than  the  allocate/free  functions  just 
described  and  the  vec.p  arithmetic  functions.  It  is  the  responsibility  of  programs  using  Cvl  to 
break  up  the  block  returned  by  alo.foz  into  pieces  for  storing  individual  vectors.  One  way  of 
doing  this,  used  in  .an  interpreter  for  VcoDE,  is  described  in  [5]. 

Cvl  instructions  generally  take  handles  to  all  their  source  and  destination  vectors.  In  addition, 
there  is  a  length  argument  for  each  vector  (except  when  several  arguments  are  required  to  have 
the  same  length),  and  a  segment  descriptor  argument  for  each  segmented  operand.  Almost  all  CvL 
functions  take  as  a  final  argument  a  vec.p  for  a  scratch  space  which  the  function  may  use  for 
storage  of  intermediate  vectors  (see  below  for  an  example  of  this).  Thus,  the  function  add.wuzO 
(which  adds  corresponding  elements  of  two  integer  vectors)  has  the  prototype: 

void  add_wuz(vec-p  dest,  vec.p  srcl,  vec.p  src2,  int  len,  vec.p  scratch) 

For  each  Cvl  vector  function  f  oo.bar,  there  is  a  function  f  oo.bar.scratch(l9n)  (or,  for  segmented 
Cvl  instructions,  foo_bar.3cratch(len,  num.segs))  which  returns  the  amount  of  scratch  space 
needed  by  that  function  on  vectors  of  length  len  (with  nu]n.segs  segments).  This  scratch  area  is 
required  because,  for  some  instructions  on  some  architectures,  the  amount  of  extra  space  required 
may  depend  on  vector  length  or  size.  For  example,  an  implementation  of  the  pack  function  may 
build  an  internal  index  vector;  this  vector  would  be  put  in  the  scratch  space. 

Let  us  take  a  look  at  a  CvL  fragment  that  puts  all  this  together.  Suppose  that  we  have  two 
integer  vectors,  a  and  b  of  length  len,  allocated  at  the  start  of  vector  memory.  We  wish  to  put  the 
elementwise  sum  of  these  vectors  into  a  new  vector,  allocated  after  b  in  memory; 
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vec-p  add. vectors (vac.p  a.  vec.p  b.  int  len)  { 
vec-p  sum  *  add.fovCb,  siz.fozClen)) ; 
vec.p  scratch  *  add.fov(sum,  siz.foz(len)) ; 
add.vuz(sum.  a.  b,  len.  scratch); 
rat\im  sum; 

} 

First  we  use  the  vec.p  addition  function  to  get  handles  to  two  distinct  regions  of  memory  after  b 
that  can  contain  the  sum  and  the  scratch  area.  Then  we  use  these  as  arguments  to  the  add.vuz 
function  which  does  all  the  work.  More  complete  CvL  code  is  given  in  Appendix  B. 

This  function  assumed  that  there  was  enough  memory  available  to  store  the  result  and  scratch 
vectors.  There  is  no  requirement  that  add.fov  do  any  error  checking:  the  result  of  the  call  may 
be  an  illegal  handle.  In  general,  CvL  does  no  error  checking.  It  is  the  responsibility  of  the  calling 
program  to  manage  vector  memory  and  make  sure  that  enough  space  is  available. 

3  CvL  Data  Types 

All  CvL  instructions  operate  on  homogeneous  vector  arguments  whose  elements  are  of  a  specific 
type.  For  example,  there  are  separate  elementwise  addition  functions  for  vectors  of  doubles  and 
for  vectors  of  integers.  The  allowed  element  types  are  int,  double,  and  cvl.bool.  Integers  and 
doubles  are  the  standard  (machine-dependent)  size  and  precision;  booleans  might  be  stored  as 
words,  chars,  bits,  etc.,  depending  on  time  and  space  tradeoffs.  It  is  the  responsibility  of  the  calling 
program  to  use  the  CvL  function  that  corresponds  to  the  type  of  the  elements  of  a  vector.  Cvl  is 
low-level  and  has  (at  best)  minimal  error  detection;  failure  to  use  the  correct  function  may  lead  to 
unpredictable  results  or  compile-  or  run-time  failures. 

There  are  two  varieties  of  Cvl  vectors:  segmented  and  unsegmented  [1].  An  unsegmented 
vector  is  a  standard  vector:  a  one-dimensional  data  structure  containing  elements  of  the  same 
type.  A  segmented  vector  is  a  data  structure  consisting  of  a  group  of  vectors  of  the  same  element 
type.  Segmented  vectors  have  the  property  that  a  function  applied  to  them  applies  to  each  of 
its  segments  independently.  For  example,  a  plus  reduction  of  a  segmented  vector  will  sum  each 
segment: 

segmented-plus- reduce  [[7  2  9]  [8  4  5  6  3]]  =  [18  26]. 

The  Cvl  implementation  of  a  segmented  vector  is  an  unsegmented  vector  together  with  a  seg¬ 
ment  descriptor  that  describes  how  to  partition  the  vector  into  sub-vectors.  Most  Cvl  functions 
on  unsegmented  vectors  have  a  counterpart  for  segmented  vectors.  These  functions  perform  the 
unsegmented  function  on  each  sub-vector  and  package  the  result  into  a  new  segmented  vector. 

Segment  descriptors  are  stored  in  the  vector  memory  along  with  the  vectors.  There  are  two 
length  quantities  associated  with  each  segment  descriptor:  the  number  of  segments  and  the  total 
number  of  elements  in  the  vector.  For  example,  the  segmented  vector 

8  »  [[7  2  9]  [8  4  5  6  3]] 

has  2  segments  and  8  elements.  The  segmented  vector  s  would  be  represented  as  an  uns^mented 
vector  of  eight  elements 

v»  [7  298456  3] 

and  a  segment  descriptor  that  describes  the  partitioning.  One  concrete  representation  of  a  segment 
descriptor  is  a  vector  of  segment  lengths: 
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d  -  [3  5]  . 


another  is  a  vector  of  the  indices  of  the  initial  element  of  each  segment: 
d’  «  [0  3]. 

The  internal  representation  of  a  segment  descriptor  is  implementation-dependent;  the  only  restric¬ 
tion  is  that  vectors  and  segment  descriptors  are  both  handled  by  objects  of  C  type  vec^. 

Since  not  ail  CvL  functions  (the  elementwise  ones,  for  example)  need  segmentation  information, 
unsegmented  Cvl  functions  (an  elementwise  operation,  for  example)  may  take  a  segmented  vector 
as  an  argument,  and  operate  on  that  vector  without  regard  to  segmentation. 

All  segmented  functions  must  be  supplied  with  segment  descriptor  and  both  length  quantities 
as  arguments.  For  example,  the  prototype  for  the  segmented  plus  scan  operation  is: 
void  add-naz  (vec.p  dest,  vec.p  src,  vac.p  segd,  int  vec.len, 
int  sag- count,  vec.p  scratch) 

A  segment  descriptor  can  only  be  created  with  the 

mke-f 08 (vec.p  out.segd,  vec.p  lengths,  int  vec.len,  int  seg.count, 
vec.p  scratch) 

function;  lengths  is  a  vector  of  length  seg.coimt  containing  segment  lengths  and  vec  J.en  is  the 
number  of  elements  in  the  vector;  the  segment  descriptor  is  returned  in  out.segd.  The  size  of  a 
segment  descriptor  is  determined  with  the  function 
siz-fosCint  vec.len,  int  seg.count). 

4  Instruction  Classes 

This  section  classifies  the  instructions  supplied  by  Cvl.  Full  descriptions  of  each  routine  and  its 
arguments  are  given  in  Appendix  A. 

Cvl  functions  are  named  according  to  a  strict  naming  convention  for  easy  recall.  The  name  of 
each  Cvl  function  has  four  components.  The  first  is  a  three  letter  mnemonic  for  the  root  function 
being  applied:  e.g.,  add,  sub,  Ish  (left  shift),  sel  (select).  The  second  field  is  a  consonant  denoting 
a  modifier  for  the  function  that  explains  how  it  is  used:  e.g.,  r  (reduce),  s  (scan),  p  (permute). 
The  next  field  is  a  vowel  indicating  the  kind  of  vector  to  which  the  function  is  to  be  applied:  u 
(unsegmented),  e  (segmented),  and  o  (none  or  scalar).  The  final  field  is  a  consonant  specifying  the 
type  of  the  elements  of  the  vector:  e.g.,  b  (boolean),  z  (integer),  s  (segment- descriptor).  The  first 
field  is  separated  from  the  next  three  by  an  underscore.  Table  1  gives  the  complete  list  of  modifiers. 
Mix  and  match  name  fields  for  the  function  you  need.’ 

For  each  Cvl  function  (with  the  exception  of  a  few  of  the  facility  functions)  there  are  two 
auxiliary  functions  with  names  ending  in  the  extensions  .scratch  and  .inplace.  The  scratch 
functions  were  discussed  earlier  and  return  the  amount  of  scratch  space  required  by  a  function,  for 
input  of  a  given  size.  To  provide  better  memory  reuse  and  locality,  CVL  provides  inplace  auxiliary 
functions  for  most  Cvl  functions.  These  functions  indicate  which  arguments  of  a  CvL  function 
may  be  used  destructively;  in  other  words,  which  source  vectors  can  be  overwritten  to  form  the 
de|tination  vector.  These  inplace  functions  return  the  bitwise  or  of  the  appropriate  subset  of  the 
values 

INPLACE_NOHE,  IMPLACE.1,  IMPLACE.2 . 


'The  consonant-vowel-consonant  scheme  for  suffixes  leads  to  pronounceable  function  names.  You  too  will  soon 
have  “add.wuz”  and  “len  Jos"  rolling  off  your  tongue  like  a  pro! 
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function  name  =  fun-[fun_type)[vec_type][element_type] 

fun -type 

vec.type 

element-type 

V  elementwise 

8  scan 

r  reduce 
p  permute 

V  vector- scalar 
f  facilities 

1  library 

u  unsegmented 
e  segmented 
o  none  (scalar) 

b  boolean 
z  integer 
d  double 

s  segment-descriptor 
»  vec-p 

Table  1:  Fields  for  CvL  function  names. 


which  must  be  defined  in  cvl.h.  These  values  are  of  type  unsigned  int.  For  example,  the  second 

source  vector  of  an  elementwise  integer  add  can  be  overwritten  only  if 
add-wuz-inplace()  k  INPLACE-2 

returns  a  nonzero  value. 

The  CvL  library  functions  are  divided  into  a  number  of  different  classes: 

elementwise  The  elementwise  operations  perform  a  function  either  to  every  element  of  a  vector 
or  to  corresponding  elements  of  several  different  vectors  of  the  same  length.  Examples  of  ele¬ 
mentwise  operations  include  negating  each  element  of  a  vector  and  adding  the  corresponding 
elements  of  two  vectors. 

reduce  The  reduce  functions  combine  all  elements  of  a  vector  together  under  the  operation  of  some 
other  function  such  as  addition  or  minimum.  Examples  of  reduce  functions  include  summing 
the  elements  of  a  vector  and  finding  the  smallest  element  of  a  vector.  For  zero  length  vectors, 
the  result  is  the  identity  element  for  the  operation. 

scan  The  scan  instructions  generalize  the  reduce  functions  by  creating  a  vector  whose  ith  element 
is  the  reduction  of  the  first  i  —  1  elements  of  the  argument  vector.  The  first  element  returned 
is  the  identity  element  under  the  operation. 

permute  The  permute  functions  rearrange  the  elements  of  a  vector  of  values  according  to  a  vector 
of  indices.  The  simple  permute  is  a  scatter/send  operation,  while  the  back-permute  is  a 
gat  her /get  operation. 

vector-scalar  The  vector-scalar  operations  convert  vectors  to  scalars  and  back  again.  The  extract 
function  returns  a  specified  element  of  a  vector;  the  replace  function  replaces  a  specified 
element  of  a  vector  with  a  given  value.  The  distribute  function  creates  a  vector  of  given 
length,  all  of  whose  elements  have  the  same  specified  value. 

facilities  The  facilities  operations  perform  needed  system  functions  and  deal  with  the  representa¬ 
tion  of  vectors  and  segment  descriptors.  Facility  functions  include  vector  memory  allocation 
and  freeing,  creation  of  segment  descriptors,  and  computing  the  size  of  a  vector  whose  ele¬ 
ments  are  of  some  type. 

library  The  library  functions  consist  of  a  number  of  routines  that  could  be  expressed  in  terms  of 
other  CvL  functions,  but  which,  for  reasons  of  efficiency  or  consistency  of  the  Cvl  model, 
were  implemented  specially. 
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5  Conclusion 

We  are  interested  in  any  comments  or  bug  reports  about  CvL  and  the  contents  of  this  technical 
report.  Source  code  for  our  implementations  of  Cvl  are  available  via  anonymous  ftp.  For  further 
details,  please  send  email  toblellochCcs.cmu.edu.  Future  changes  in  Cvl  are  likely;  please  send 
email  to  find  out  about  the  latest  version  of  the  library. 

A  Cvl  Instructions 

This  appendix  gives  a  description  and  the  interface  definition  for  each  CvL  instruction.  Most 
Ainctions  will  be  defined  on  all  data  types  for  which  the  operation  makes  sense;  not  all  valid  type 
signatures  are  given  here. 

As  a  general  rule,  the  first  argument  to  a  Cvl  functions  is  a  vec_p  corresponding  to  the  desti¬ 
nation  argument.  The  contents  of  this  vector  will  be  overwritten  with  the  result  of  the  operation. 
Each  non-facility  function  takes  as  its  final  argument  a  V8c_p  that  refers  to  a  scratch  space.  The 
required  minimum  size  of  this  scratch  space  is  returned  by  fun-3cratch(len) ,  where  len  is  the 
length  of  the  vector  arguments  to  the  function  fim. 

A.l  Facilities 

1.  The  size  functions  return  the  number  of  vector  memory  units  required  by  a  vector  of  given 
type  and  length.  In  the  case  of  segment  descriptors,  both  the  length  of  the  segment  descriptor 
and  the  length  of  the  unsegmented  vector  must  be  supplied. 

int  siz-foz  (int  length) 

int  siz-fob  (int  length) 

int  siz-fod  (int  length) 

int  siz-fos  (int  vec_len,  int  seg^ count) 

Note'  Vectors  and  segments  of  a  vector  may  have  zero  length.  All  CvL  functions  must  work 

probably  when  the  length  argument  is  zero. 

2.  Segment  descriptors  are  created  from  a  vector  of  segment  sizes  using  the  mke_fov  function. 
The  inverse  function  is  len-fos.  There  are  scratch  functions  corresponding  to  both  of  these 
functions. 

void  mke_fov  (vec.p  segd,  vec_p  lengths,  int  vec.len,  int  seg.count, 
vec-p  scratch) 

void  len-fos  (vec.p  lengths,  vec.p  segd,  int  vec.len,  int  seg.count, 
vec-p  scratch) 

void  fflke-fov-scratch  (int  vec.len,  int  seg.count) 
void  len.fos.scratch  (int  vec.len,  int  seg.count) 

Here,  lengths  is  an  integer  vector  containing  segment  lengths,  segd  is  a  segment  descriptor, 
vec.len  is  the  length  of  the  unsegmented  vector  (the  sum  of  the  values  in  lengths),  and 
seg.count  is  the  number  of  segments  (number  of  elements  in  lengths). 

3.  Cvl  provides  memory  allocation  and  freeing  instructions.  As  explained  earlier,  these  func¬ 
tions  may  only  be  invoked  once  per  program  execution. 

vec.p  alo.foz  (int  size) 
void  fre.fov  (vec.p  vec) 


6 


alo.foz  tries  to  allocate  size  units  of  vector  memory.  The  allocation  function  returns 
(vec^)MULL  if  the  allocation  fails. 

4.  CvL  provides  a  move  vector  instruction: 

void  mov_fov(vec_p  dest,  vec_p  src,  int  size,  vec.p  scratch) 
void  mov.fov-scratchCint  size) 

This  instruction  moves  a  block  of  vector  memory  from  one  location  to  another.  The  im¬ 
plementation  of  mov_fov  must  allow  situations  in  which  the  source  and  destination  vectors 
overlap. 

5.  CvL  provides  the  following  arithmetic  functions  on  the  vec.p  type: 

vec.p  add. fov( vec.p  v,  int  a) 
int  sub.fovCvec.p  vl,  vec.p  v2) 
cvl.bool  eql.fovfvec.p  vl,  vec.p  v2) 
int  cmp-fovCvec.p  vl,  vec.p  v2) 

The  add  function  takes  a  vec.p  v  and  an  integer  a  and  returns  a  new  vec^  corresponding 
to  a  region  of  vector  memory  that  is  guaranteed  not  to  overlap  a  vector  with  vec.p  v  and 
size  a.  The  subtract  function  takes  two  vec.p  arguments  and  returns  the  size  of  the  largest 
vector  that  may  be  stored  at  the  first  argument  without  overlapping  a  vector  stored  at  the 
second.  The  equality  function  returns  1  if  the  two  arguments  refer  to  the  same  region  of 
vector  memory^  and  returns  0  otherwise.  The  compare  function  compares  two  vec.ps  and 
returns  a  positive  value  if  vl  corresponds  to  a  region  of  vector  memory  after  that  of  v2, 
returns  0  if  the  vec-ps  are  equal  (in  the  sense  above),  and  returns  a  negative  value  otherwise. 

Implementation  note:  The  actual  pointer  to  vector  memory  (either  the  vec.p  or  one  of  its 
components)  must  be  “maximally  aligned.”  In  other  words,  given  a  vec.p,  a  vector  of  any 
type  must  be  storable  in  the  corresponding  memory  block.  Without  this  property,  it  might 
not  be  possible,  for  example,  to  move  an  integer  vector  into  the  location  once  held  by  a 
boolean  vector.  The  implementations  of  alo.foz  and  add.f  ov  must  guarantee  this  property 
of  the  vec.j)  returned. 

6.  CvL  includes  two  timing  functions  for  benchmarking  purposes.  The  get  time  function, 
tgt.fos,  stores  a  time  stamp  in  a  structure  of  type  cvl.timer.t.  The  time  difference  func¬ 
tion,  tdf  _fo8,  takes  two  such  structures  and  returns  a  double  precision  count  of  the  number 
of  seconds  between  the  two  events. 

void  tgt.fos  (cvl.timer.t  «time) 

double  tdf.fos  (cvl.timer.t  ♦tl,  cvl.timer.t  *t2) 

7.  To  provide  a  convenient  interface  between  normal  C  arrays  and  the  CvL  vectors,  CvL  provides 
translation  functions  that  convert  between  the  two: 

void  v2c.f\iz(int  *de8t,  vec.p  src,  int  len,  vec.p  scratch) 
void  c2v.fuzfvec.p  dest,  int  ♦src,  int  len,  vec.p  scratch) 

v2c  writes  the  contents  of  a  vector  into  a  unit-stride  C  array;  c2v  performs  the  inverse 
operation.  These  instructions  exist  for  boolean,  integer  and  double  precision  types.  They  do 
not  exist  for  segment  descriptors;  these  must  first  be  converted  to/from  length  vectors. 


^This  does  not  necessarily  mean  that  vl  •  v2.  One  possible  implementation  of  a  v«c^  is  as  a  pointer  to  a 
structure,  within  which  is  a  pointer  into  vector  memory;  eql.lov  should  return  1  if  the  pointer  fields  of  the  structure 
are  equal. 
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A. 2  Elementwise  functions 

All  source  and  destination  arguments  to  elementwise  functions  must  be  of  the  same  length,  which  is 
supplied  as  a  parameter.  The  same  elementwise  function  works  on  either  segmented  or  unsegmented 
vectors,  with  no  segment  descriptor  required. 

The  logical  functions  and,  ior,  and  not  are  defined  on  both  integers  and  booleans.  They  act 
as  bitwise  operators  on  integers  and  as  logical  operators  on  booleans. 

Table  2  gives  a  list  of  all  the  elementwise  functions  provided  by  CvL. 

1.  The  unary  elementwise  functions  include  the  type  conversion  routines  (int,  dou,  and  boo), 
various  arithmetic  functions  (fir,  cai,  trn,  rou,  log,  exp,  sqt),  negation,  and  copy,  cpy-irus 
takes  both  segment  count  and  vector  length  arguments. 

void  not_iru2  (vac.p  dost,  vac.p  src,  int  len.  vec.p  scratch) 

void  cpy.wuz  (vec.p  dast,  vec.p  src,  int  len,  vec.p  scratch) 

void  cpy.trus  (vec.p  dest,  vec.p  src.  int  vec.len,  int  sag.count, 

vec.p  scratch) 

2.  The  binary  elementwise  functions  include  all  the  standard  arithmetic,  logical,  and  comparison 
functions. 

void  add-wuz  (vec.p  dast,  vec.p  srcl,  vec.p  src2,  int  len,  vec.p  scratch) 

3.  The  only  ternary  function  is  select.  The  select  operations  take  three  vector  source  argu¬ 
ments;  a  boolean  vector  and  two  vectors  of  the  same  type. 

void  sel.wub  (vec.p  dest,  vec.p  bool.vec,  vec.p  si,  vec.p  s2,  int  len, 
vec.p  scratch) 

The  result  vector  has  value: 

destCi]  =  (bool.vec [i]  ?  sl[i]  :  s2[i]) 


A. 3  Scan  and  Reduce 

The  reduce  functions  combine  all  the  elements  in  the  source  vector,  using  the  identity  element 
as  the  initial  combining  element.  The  unsegmented  reduce  returns  this  value;  the  segmented 
version  combines  each  segment  independently  and  returns  all  these  results  in  the  destination  vector 
argument. 

int  add.ruz  (vec.p  src,  int  len,  vec.p  scratch) 
void  add-rez  (vec.p  dest.  vec.p  src,  vec.p  segd,  int  vec.len, 
int  sag- count,  vec.p  scratch) 

The  scan  functions  compute  a  running  reduction  of  the  source  vector  and  return  this  result  in 
a  destination  argument.  As  with  the  reduce  functions,  the  identity  element  is  the  initial  combiner, 
void  add-suz  (vec.p  dest,  vec.p  src,  int  len.  vec.p  scratch) 
void  add-sez  (vec.p  dest,  vec.p  src,  vec.p  segd,  int  vec.len, 
int  aeg. count,  vec.p  scratch) 

Scan  and  reduce  functions  are  provided  for  addition,  subtraction,  multiplication,  or,  and,  xor, 
maximum,  and  minimum. 
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I  function 

mnemonic 

function 

add 

addition 

sub 

subtraction 

eql 

equal 

cpy 

copy 

min 

minimum 

max 

maximum 

mul 

multiplication 

div 

division 

mod 

modulus 

rnd 

random 

Ish 

left  shift 

rsh 

right  shift 

greater  than 

les 

less  than 

geq 

grt  than  or  eq 

leq 

less  than  or  eq 

neq 

not  equal 

not 

not 

ioc 

inclusive  or 

and 

and 

xor 

exclusive  or 

log 

natural  log 

exp 

exp 

sqt 

sqrt 

sin 

sine 

cos 

cosine 

tan 

tangent 

asn 

arcsin 

acs 

arccosine 

atn 

arctangent 

snh 

sinh 

csh 

cosh 

tnh 

tanh 

sel 

select 

int 

integer 

dbl 

double 

boo 

bool 

fir 

floor 

cei 

ceiling 

trn 

truncate 

rou 

round 

eiemeiit  type 
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C  equivalent 


d  =  a 

d  =  min(a,  b) 
d  =  inax(a,  A) 
d  =  a  •  b 
d  =  a/b 

d  =  a%b 

d  =  random!  )%a 
d-a  «b 

d  =  a  >>  b 
d  =  a  >  b 
d  =  a  <  b 
d  =  a  >=  6 
d  =  a  <=  b 
d  =  a  }  =  b 
d  =  !a,  d  =  'a 
d  =  a||6,  d  =  a|6 
d  =  atik.6,  d  =  aici 
d  =  a'b 
d  =  log(o) 
d  =  exp(a) 
d  =  sqrt(a) 
d  =  sin(a) 
d  =  cos(a) 
d  =  tan(a) 
d  =  asin(a) 
d  =  acos(a) 
d  =  atan(a) 
d  =  sinh(a) 
d  =  cosh(a) 
d  =  tanh(a) 


d  =  (int)(a) 
d  =  (double)(a) 
d=  !!(o) 
d  =  (int)floor(a) 
d  =  (int)ceil(a) 
d  =  (int)(a) 
d  =  (int)rint(<i) 


Table  2:  List  of  elementwise  CvL  functions.  In  each  case,  d  refers  to  the  destination  vector,  and 
a,  b,  and  c  are  the  argument  vectors.  For  ior,  and,  and  not  the  two  versions  are  for  booleans 
and  integers,  respectively.  The  semantics  of  each  elementwise  function  is  equivalent  to  the  those 
defined  by  ANSI  C  for  the  C  version  in  this  table. 
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A.  4  Permute 

The  permute  functions  take  source  and  index  vectors  and  write  the  result  into  a  destination  vector. 
Simple  and  backward  permutes,  both  segmented  and  unsegmented,  are  part  of  the  basic  library. 
Also  provided  are  backward  and  forward  dag  permutes,  and  forward  default  and  default- dag  per¬ 
mutes.  For  those  permute  operations  with  differing  source  and  destination  lengths,  two  sets  of 
length  or  segment  descriptor  arguments  must  be  provided.  Any  constraints  on  vector  arguments 
given  below  must  also  be  satisded  by  each  segment  in  the  segmented  version  of  the  operation. 

1.  Simple  (or  forward)  permute  puts  an  element  into  the  location  in  the  destination  given  by  the 
index  vector;  dest  [index  [i]]  »  arcCi].  The  elements  of  index  must  be  a  proper  permu¬ 
tation  (all  indices  present,  and  no  repetitions)  of  the  allowable  range.  All  vector  arguments 
must  be  the  same  length. 

void  smp.puz  (vec.p  dest.  vec.p  arc,  vec.p  index,  int  len, 
vec.p  acratch) 

void  amp.pez  (vec.p  deat,  vec.p  arc,  vec.p  index,  vec.p  aegd, 
int  vec-len,  int  aeg. count,  vec.p  acratch) 

2.  Backward  permute  gets  an  element  from  the  indexed  location: 

deat[i]  *  arc [index [i]] 

The  destination  and  index  vectors  must  be  of  the  same  length,  which  can  differ  from  that  of 
the  source. 

void  bck.puz  (vec.p  deat,  vec. ^  arc,  vec.p  index, 

int  arc.len,  int  deat. len,  vec.p  acratch) 

void  bck.pez  (vec.p  deat,  vec.p  arc,  vec.p  index, 

vec.p  arc. aegd,  int  src.vec.len,  int  arc. aeg. count , 
vec.p  deat. aegd,  int  deat. vec. len,  int  deat. aeg. count , 
vec.p  scratch) 

3.  The  default  permute  operation  is  a  simple  permute  that  relaxes  the  restriction  that  the  source 
and  destination  must  be  of  the  same  length.  Any  unassigned  position  in  the  destination  gets 
its  value  from  the  corresponding  position  of  a  default  vector. 

void  dpe.puz  (vec.p  dest,  vec.p  arc,  vec.p  index,  vec.p  default, 
int  arc.len,  int  dest.len,  vec.p  scratch) 

void  dpe.pez  (vec.p  dest,  vec.p  arc,  vec.p  index,  vec.p  default, 
vec.p  src-segd,  int  src.vec.len,  int  arc. aeg. count , 
vec.p  dest. aegd,  int  dest. vec. len,  int  dest. aeg. count , 
vec.p  scratch) 

4.  The  flag  permute  is  a  combination  of  the  select  and  permute  operations:  a  flag  vector  de¬ 
termines  whether  or  not  each  element  is  moved.  The  set  of  indices  and  values  not  masked 
are  the  arguments  to  the  appropriate  permute  operation.  For  example,  in  the  simple  flag 
permute  (fpm): 

if  (flag3[i])  {  de8t[inde.  i]]  »  3rc[i] ;  } 

In  the  backwards  flag  permute  (bfp),  unfilled  positions  in  the  destination  vector  are  set  to  0 
or  false.  There  are  simple  (fpm),  backward  (bfp),  and  default  (dpf )  flag  permute  operations. 

void  fpm.puz  (vec.p  dest,  vec.p  arc,  vec.p  index,  vec.p  flags, 
int  arc.len,  int  dest.len,  vec.p  acratch) 


10 


void  fpm-pez  (v«c.p  deat.  vec.p  arc.  vac.p  index,  vec.p  flaga, 

vec-p  src-segd,  int  src.vec.lan,  int  arc. aeg. count , 
vec.p  deat.aegd,  int  deat.vec.len,  int  deat .aeg. count , 
vec.p  acratch) 

void  bfp.puz  (vec.p  deat.  vec.p  arc,  vec.p  index,  vec.p  flaga, 
int  arc.len,  int  deat.len,  vec.p  acratch) 
void  bfp.pez  (vec.p  deat,  vec.p  arc.  vec.p  index,  vec.p  flaga, 

vec.p  arc.aegd,  int  arc.vec.len,  int  arc. aeg. count , 
vec.p  deat.aegd,  int  deat.vec.len.  int  deat. aeg. count , 
vec.p  acratch) 

void  dpf.puz  (vec.p  deat,  vec.p  arc,  vec.p  index,  vec.p  flaga, 
vec.p  default,  int  arc.len.  int  deat.len, 
vec.p  acratch) 

void  dpf.pez  (vec.p  deat,  vec.p  arc,  vec.p  index,  vec.p  flaga, 
vec.p  default. 

vec.p  arc.aegd,  int  arc.vec.len,  int  arc. aeg  .count , 
vec.p  deat.aegd.  int  deat.vec.len.  int  deat. aeg. count . 
vec.p  acratch) 

A.  5  Vector-scalar 

Extract,  replace,  and  distribute  functions  exist  in  both  segmented  and  unsegmented  versions.  Ex¬ 
tract  is  an  indexing  function  that  removes  the  requested  element  from  a  vector  and  returns  a  scalar 
(unsegmented  version)  or  an  unsegmented  vector  (segmented  version).  Replace  is  the  inverse  oper¬ 
ation.  It  is  a  destructive  operation,  modifying  the  contents  of  the  destination  vector.  The  output 
of  distribute  is  a  vector  of  given  length,  all  of  whose  elements  have  a  given  value, 
void  dis.vuz  (vec.p  dest,  int  value,  int  len,  vec.p  scratch) 
void  dis.vez  (vec.p  dest,  vec.p  value,  vec.p  dest.segd, 

int  dest. vec. len,  int  dest.seg. count ,  vec.p  scratch) 

int  ext.vuz  (vec.p  src,  int  index,  int  len,  vec.p  scratch) 
void  ext.vez  (vec.p  dest,  vec.p  src,  vec.p  index,  vec.p  src.segd, 
int  src. vec. len,  int  s rc.seg. count ,  vec.p  scratch) 

void  rep.vuz  (vec.p  src,  int  index,  int  value,  int  len,  vec.p  scratch) 
void  rep.vez  (vec.p  dest,  vec.p  src,  vec.p  value,  vec.p  segd, 
int  vec.len,  int  seg.len,  vec.p  scratch) 

A. 6  Library 

CvL  also  contains  a  set  of  library  functions.  These  functions  could  be  implemented  in  terms  of  the 
other  primitives,  but  for  efficiency,  may  be  implemented  directly. 

1.  The  pack  primitive  has  been  divided  into  two  parts,  pkl  and  pk2.^  The  first  part  of  pack, 
pklJ.ev  (this  function  is  independent  of  element  type),  takes  a  flag  vector,  and  returns  a 
vtdue  (vector)  giving  the  number  of  true  elements  (in  each  segment). 

^This  was  done  in  order  to  facilitate  memory  management.  The  first  part  of  the  operation  provides  information 
required  to  obtain  a  v«c-p  for  the  results  of  the  second  part. 
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int  pkl.luv  (vec_p  flags,  int  vec.lan,  vac.p  scratch) 
void  pkl-lev  (vec.p  dest,  vec.p  flags,  vec_p  segd, 

int  vec.len,  int  seg.coiint,  vac.p  scratch) 

If  flags  ■  [T  F]  [F  F  T  T],  then  after  calling  pklJ.ev,  the  value  of  dast  is  [1  2].  The 
pk2J.a*  functions  fill  a  destination  vector  with  elements  corresponding  to  the  true  elements 
of  the  flag  vector. 

void  pk2.1uz  (vac.p  dast,  vac.p  src,  vac.p  flags, 

int  src.lan,  int  dast.lan,  vec.p  scratch) 
void  pk2-lez  (vec.p  dest,  vec.p  src,  vec.p  flags, 

vac.p  src-sagd,  int  src.vec.len,  int  src. sag. count , 
vec.p  dast. segd.  int  dast. vec.lan.  int  dest. sag. cotmt , 
vac.p  scratch) 

It  is  the  responsibility  of  the  calling  function  to  use  the  result  of  the  first  pack  step  to  allocate 
the  destination  vector  and  destination  segment  descriptor. 

2.  The  index  function  generates  a  vector  of  integer  values,  starting  from  a  given  initial  value, 
with  given  stride,  and  generating  a  given  number  of  values.  CvL  provides  both  segmented 
and  unsegmented  index  functions. 

void  ind.luz  (vac.p  dast,  int  init,  int  stride,  int  count,  vec.p  scratch) 
void  ind-laz  (vac.p  dest,  vec.p  init,  vec.p  stride,  vec.p  count, 

vac.p  dast. segd,  int  dest. vac. Ian,  int  dast. sag. count , 
vac.p  scratch) 

The  arguments  to  the  index  functions  are  similar  to  those  of  the  distribute  functions. 

3.  CvL  provides  rank  functions  for  sorting  vectors  of  doubles  or  integers.  Rank  returns  a  per¬ 
mutation  indicating  how  the  source  elements  are  ordered.  For  segmented  rank,  the  result 
restarts  at  0  for  each  segment,  i.e.  each  segment  contains  a  permutation. 

rku.luz(vec.p  rank,  vec.p  src,  int  len,  vec.p  scratch) 
rku.laz(vac.p  rank,  vec.p  src,  vec.p  segd,  int  vec.lan, 
int  sag. count,  vac.p  scratch) 

The  rku  functions  perform  an  upward  rank  (lowest  element  gets  rank  0);  the  rkd  functions 
perform  a  downward  rank  (highest  element  gets  rank  0).  The  source  is  an  integer  or  double 
vector.  The  result  is  always  an  integer  vector.  The  rank  is  stable. 

B  Example:  Dot  Product 

We  give  as  an  example  of  C  code  that  uses  CvL  to  calculate  the  dot  product  of  two  vectors.  The  code 
is  shown  in  Figure  1.  Two  vectors  are  generated  by  function  calls  inside  the  dot  product  function.'* 
We  could  have  eliminated  some  of  the  scratch  checking  by  precomputing  how  much  memory  is 
needed  (since  this  only  depends  on  vector  length  and  which  CvL  operations  are  performed)  before 
doing  the  allocation. 

‘These  vectors  would  typically  be  passed  as  arguments,  but  we  wanted  to  show  how  vector  memory  is  allocated. 
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/«  Complete  CVL  example:  dot  product  of  double  precision  vectors. 


»  This  routine  takes  a  vector  length  as  input.  It  allocates  vector  memory,  calls  two  (dummy)  functions 

•  to  generate  the  operands  for  a  dot  product,  computes  the  dot  product,  frees  memory,  performs  (some) 
m  error  checking,  and  returns  the  value  of  the  dot  product.  This  has  all  the  gory  details! 

m 

m  We  assume  that  the  dummy  functions  take  three  arguments:  a  vector  length,  a  vec.p  in  which  to 

•  store  the  result,  and  some  scratch  space. 

./ 

•include  <cvl.h>  /»  This  contains  all  CVL  declarations  »/ 


doable  dotpCint  len) 

{ 


v«c-p 

/. 

vector  memory:  result  of  ajo./bz  */ 

int  rmizm  ■  sis.fo4(l«ii) ; 

/- 

size  of  a  double  vector  of  length  len  •/ 

/. 

amount  of  vector  memory  to  allocate  «/ 

vec-p  a,b; 

/. 

two  vector  arguments  •/ 

vec.p  product ; 

/. 

elementwise  product  of  a  and  h  •/ 

vec.p  scratch; 

/. 

scratch  storage  for  CVL  »/ 

int  acratch.naadad; 

/. 

ammount  of  scratch  needed  by  CVL  function 

doable  result ; 

/. 

final  result  «/ 

vaea  >  alo.foz(vaea.size) ; 

/• 

Allocate  memory  for  vectors  •/ 

if  (  vaea  »  (vec.p)  lUIX  )  { 

/. 

check  for  failure  •/ 

fprintf  (atderr,  “dotp:  cvl  could  not  allocate  memory 
exit(l); 

} 

a  >  vaea;  / 

b  ■  edd.fov(e,  vsize) ;  /' 

product  ■  •dd.fovCb,  vsize);  /< 

scratch  >  add.fov(prodact ,  vsize);  /> 


store  a  at  beginning  «/ 
store  b  right  after  a  •/ 
result  of  multiply  goes  here 
rest  is  for  scratch  •/ 


get.aden,  a,  scratch); 
get.bdsn,  b,  scratch); 


/«  create  first  vector  •/ 

/•  create  second  vector  •/ 


/«  check  if  there  is  enough  scratch  space  for  the  multiply  operation  •/ 
scratch-oaeded  >  aul.fod.scratchdan) ; 

check.scratchCvaea,  scratch,  vaea-size,  scratch.needed) ; 
nal.sudCproduct ,  a,  b,  len,  scratch)  ;  /•  elementwise  multiply  «/ 


/«  check  for  enough  scratch  to  do  add  reduce  •/ 
scratch-needed  >  add.rud.scratchden) ; 

check.scratchlvaea,  scratch,  vaaa.siza,  scratch.naeded) ; 
result  >  add.raz(tap,  len,  scratch);  /«  add  reduce  •/ 

fre.fov(vaea) ;  /*  free  up  memory  •/ 

return  result; 

} 

/•  checic. scratch  is  a  useful  utility  function  for  verifying  that  enough  scratch  space  has  been  reserved. 

«  It  assumes  that  the  scratch  vector  is  at  the  end  of  vector  memory. 

•  exit!)  is  called  if  an  error  is  encountered. 

•/ 

void  chack.scratch(vac-p  vaea,  vec.p  scratch,  int  vaea.8ize,  int  scratch-needed) 

{ 

if  ((vaea-size  —  8Ub.fov(scratch ,  vaea))  <  scratch-needed)  { 
fprintf (stderr ,  "Not  enough  scratch  space."); 
ezitd); 

} 

} 

Figure  1:  CvL  code  for  dot  product.  This  example  demonstrates  memory  allocation,  scratch  space 
checking  and  basic  Cvl  function  calls. 


C  Changes 

This  section  lists  some  of  the  changes  between  this  and  older  versions  of  the  Cvl  documentation. 

change  The  general  scan  operation  is  no  longer  directly  supported,  and  the  ***_n**  instructions  have 
all  been  renamed  to  ***  j**. 

new  Scan  and  reduce  instructions  for  xor  and  multiplication  have  been  added. 

new  The  nov_f  ov  instruction  has  been  added  and  the  requirement  that  cpy.wu*  handle  overlapping 
vectors  has  been  removed. 

new  The  v2c  and  c2v  instructions  have  been  added. 

new  The  index  and  rank  library  instructions  have  been  added. 

change  The  inplaca  mechanism  has  been  changed. 

change  The  or.***  function  has  been  renamed  to  ior.***;  all  functions  now  have  a  three  letter  root. 
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